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Some  of  the  basic  concepts  of  information  theory  are  crit- 
ically reviewed  in  the  light  of  a  generalized  formulation 
of  the  theory  of  Markoff 's  chains,  in  which  the  initial 
and  final  states  are  sequences  of  symbols  of  different 
lengths,  and  occurrence  of  symbols  is  governed  by  inter- 
symbol correlation  of  finite  range.  In  particular,  the 
conditions  of  ergodicity  and  the  structure  of  "ergodic 
subsets"  of  sequences  of  arbitrary  length  are  carefully 
discussed.  A  mathematical  method  is  developed  to  determ- 
ine the  "range"  and  "strength"  of  intersymbol  correlation. 
A  brief  summary  of  the  oontent  is  given  at  the  end  of  Sec- 
tion 1. 


#1,  Introduction 

The  aim  of  this  paper  is  to  clarify  some  of  the  basic ,  but  often 
carelessly  used  concepts  of  information  theory,  viz.,  the  concepts  of 
ergodicity,  intersymbol  correlation  and  redundancy.  There  are  two  ap- 
proaches to  this  problem-complex  pertaining  to  probability.  One  is 
an  empirical  point  of  view,  and  probability  here  is  understood  in  its 
statistical  aspect.  The  other  is  an  a  priori  point  of  view  which 
deals  with  probability  mainly  in  its  predictive  aspect.  In  the  first 
standpoint,  the  entire  population  of  messages  in  a  language  is  sup= 
posed  to  be  given,  and  the  various  probabilities  are  calculated  by 
the  actual  frequencies  of  individual  symbols  or  those  of  sequences 
of  symbols.  According  to  this  method,  a  unique  value  of  the  proba- 
bility of  appearance  of  a  given  symbol  or  a  given  sequence  can  be 
statistically  determined.  In  the  second  point  of  view,  an  ensemble 
of  messages  is  supposed  to  be  engendered  by  the  given  correlation 
probabilities  starting  from  a  given  initial  symbol  or  a  given  initial 
sequence  of  symbols.  In  this  case,  the  existence  of  a  unique,  non- 
vanishing  value  of  the  probability  of  appearance  of  a  given  symbol  or 
a  given  sequence  is  not  guaranteed,  for  it  may  vanish  with  increasing 
length  of  messages,  and  it  may  depend  on  the  initial  condition.  Thus, 
the  problem  of  ergodicity  acquires  foremost  importance  in  this  ap- 
proach. 

Our  section  2  dealing  with  the  problem  of  ergodicity  is  therefore 
developed  in  the  framework  of  the  second  point  of  view.  Once  the 
nature  of  the  ergodicity  condition  is  clarified  and  this  condition  is 


assumed  to  twS  fulfilled ,  then  a  smooth  passage  from  the  second  point 
of  view  to  the  first  becomes  easy.  Thus,  our  section  3  on  redundancy 
can  be  interpreted  in  either  point  of  view. 

It  is  not  implied  by  the  foregoing  paragraphs  that  the  problem 
of  ergodicity  is  irrelevant  to  the  first  standpoint  or  cannot  be  for- 
mulated in  the  framework  of  this  standpoint.  The  situation  is  that 
the  nucleus  of  the  problem  under  consideration  can  be  exhibited  more 
directly  and  naturally  in  the  second  point  of  view. 

The  usual  theory  of  Markoff's  chains,  which  is  based  on  transi- 
tion probabilities  from  one  state  to  another,  is  extended  in  this  pa- 
per to  the  case  where  the  probability  Q(a^  ,  .  . ,  av_,  I  av  )  of  sym- 
bol  av  appearing  in  a  message  is  dependent  on  the  (  V  -  1)  immedi- 
ately preceding  symbols,  y  being  the  range  of  inter  symbol  correlat- 
ion. A  population  of  infinitely  long  messages  is  considered  to  be 
engendered  solely  by  this  intersymbol  correlation  probability: 
Q(aa  ,  .  .,  ay_,  |  ay  )  from  a  given  (  V  -  1)  -symbol  initial  sequence,, 
The  problem  of  ergodicity  then  pertains  to  existence  of  unique  (i.e,,, 
independent  of  initial  sequence),  non-vanishing  value  of  P(a1>0  „,  ajs 
which  should  give  the  probability  that  a  u-   symbol  sequence  arbi- 
trarily  taken  from  the  population  is  (als  .„,  a  u.  ),  U  being  not 
necessarily  equal  to  J^  c  ■'•his  generalized  problem  of  ergodicity  is 
discussed  in  our  Section  2. 

It  is  shown  not  only  that  finiteness  of  correlation  range  does 
not  warrant  ergodicity,  as  is  often  erroneously  assumed  in  existing 
literature,  but  also  that  if  MX2/*  the  quantity  P  can  have  more 


than  one  finite  value  depending  on  the  initial  sequence,  a  situation 
which  does  not  exist  in  the  ordinary  Markoff  chains. 

Under  the  conditions  that  guarantee  existence  of  unique  (whether 
or  not  non- vanishing)  value  of  P,  a  convenient  quantity,  called  cor- 
relation index  W«.  ,  defined  by  Eq.  (31),  is  introduced,  characteriz= 
ing  both  "range"  and  "strength"  of  correlation.  First,  it  represents 
the  "range",  in  the  sense  that  the  actual  correlation  range  is  the 
maximum  value  of  u.  for  which  W^  £   0.  This  criterion  is  both  of 
theoretical  and  practical  interest.  Theoretically,  this  determines 
the  applicability  of  the  generalized  theory  of  Markoff 's  chains,  and 
practically,  this  can  be  used  to  measure  the  existing  correlation 
range  in  a  given  population  of  messages. 

Second,  this  quantity  Wu  represents  the  "strength"  of  corre- 
lation, in  the  sense  that  Wu  quantitatively  measures  the  decrease 
of  information  due  to  the  existence  of  M.-  symbol  correlation  as  com- 
pared with  the  (  u.  -   l)  -  symbol  correlation.  Finally  the  so-called 
redundancy  is  expressed  in  the  form  of  a  compact  series  in  ascending 
range-numbers  of  the  correlation  indices,  Eq.  (A-2). 


#2.  Ergodicity 

We  assume  the  alphabet  under  consideration  to  consist  of  N  sym- 
bols: S1,S2.,..«Sn.  We  shall  constantly  use  a  mathematical 
symbol; 

where  each  one  of  a,  ,  a2  ,  .  . ,  a  ^,  can  be  any  one  of  the  N  symbols . 
Definition  I,  The  quantity  denoted  by  (l)  represents  the  proba- 
bility that  the  last  (n  -  m)  symbols  of  a  sequence  of  n  symbols 
are  (am  +  |  ,  .  .,  a  ^  )  when  it  is  known  that  the  first  m  sym- 
bols of  the  sequence  are  (a±  ,  .  „,  am.  ). 
By  the  very  nature  of  probability,  we  have 

Q  (a,,-- ,  Q-ml  cu+,  ,  ..  y  aa)  >  o 

(2) 

If  there  is  no  correlation  between  symbols,  the  probability  of 

any  place  in  a  sequence  being  occupied  by  symbol  S^  is  independent 

of  the  preceding  symbols.  As  result,  the  only  quantity  which  deter- 

mines  a  probability  of  the  type  (l)  is  Q(S^)  which  represents  the 

probability  of  symbol  S.  appearing  at  any  one  place.  In  this  case,, 

we  have: 

Q  (a,,  ..  ,  Q,ml  a^1;  .,  ,  CU) 

=  Q(aM+, )  Q(aM+i)"  QCaw)  . 

If  the  correlation  extends,  for  instance,  over  three  consecutive 
symbols,  and  not  more  than  three,  then  the  probability  of  a  place  in 


a  sequence  being  occupied  by  symbol  S^  will  depend  on  the  two  sym- 
bols directly  preceding  it,  but  not  on  the  symbols  beyond  these  two. 
This  means  that  the  quantities  Q(S.,  Sj  j  SjJ  determine  the  general 
probability  (l):    Q  (a1y..,CU  |  Cu+,  ,  ••  ,Ci^) 

=  Q(am-uaWv|aM+l)  QC^myaw+,|aWf2)  •••QC^^^^I  a»). 

In  general,  we  have  the  following  theorem: 
Theorem  I.  If  the  intersymbol  correlation  does  not  extend 
over  more  than  /X  consecutive  symbols  in  a  sequence,  we  can 
factorize  (l)  as  follows: 

This  theorem  can  be  used  to  define  the  "range-number"  of  inter- 
symbol correlation:  this  number  V  is  the  minimum  allowable  u  in 
the  decomposition  (3). 

Assuming  the  correlation  to  be  of  range  V     ,  we  consider  all 
the  possible  sequences  whose  first  ( V   -  1)  symbols  are  given  to  be, 
say,  (aa  ,  a2  ,  .  .,  av_|  ).  Among  these  sequences  starting  with 
(a,  ,  a2  ,  .  .  av_,  ),  we  inquire  the  probability  of  those  sequences 
whose  first  y    symbols  are  (a,  b,  b2  .  .;bv,,  ).  ^his  probability 
is  obviously  given  by 

RO,,Q2,  --,av-.  I  b</b4,.-,  bj/.,  )  =  Q  (  q„  <xt,  ~,av.x  |  b„_,) 

if    (a2#  ..,  av_.)  =  (b,,..  ,  bv-2  )  , 


and  otherwise 

R  (a4,qt,--,ay-,  |  b,,  bz,..yby-, )  =o  • 

In  other  words,  the  probability  in  question  can  be  written  in  a  matrix 
form: 

Ct*iA,  ••  >  <**-i|R|  blyb4y..by-i) 

=  Q(a,,..  ,  a„-,  |by-i)  S(aa>bi)  6  (a,/bi)  •-  •  S  (<**-.  ;  by.2  )  ^     (4) 

with 

SCS-W  Sj)  =  1        if         l=j  .  I 

Using  this  matrix-expression,  the  probability,  in  the  above  popu- 
lation of   sequences,  of  a  particular  sequence  (b1;bz  .  .  b-^-i  )  appear- 
ing in  such  a  position  that  the  place  distance  between  a  t  and  b i 
is  m  symbols  can  be  given  by 

T(m)  CQi, -,*„-.  I  b„..,bv-,) 

=  (.&,,♦..;  a'-'l  ^1  bi,  • . ,  by.,  )  ,  ^ 

where  R  simply  means  the  m-th  power  of  R  in  the  sense  of  matrix- 
multiplication  . 

With  the  help  of  the  quantity  (5),  we  can  further  calculate  the 
probability  of  a  given  sequence  of  any  length  (  f*.  -  l),   say  (b1;  *  .;bu-i), 
appearing  at  any  position  after  the  initial  (a  t  , ..,  a^,  ).     If    ixyV 
this  probability  will  be 

TW(alr.;a,-,(  b1;..,br.) 
-  T(n0(a,,..  ;Qv-ilb1,../by.l)Q(b1^vby.l|b,)...Q(b/M.tf/.,b/M.,|b/M,)   (6; 


where     m     stands  for  the   symbol  distance  between     a  j.    and     bt 


If    JU<V  ,  we  have 

Tw  Ca^w^-.lb,,.-^-.) 

=  2  •■£    T  "°(a,,--,Qv-i|b,,  ..,  b*-,  ,Om,  ••,*>*_,)  (7) 

V     bv-i  '  r       r 

where     m  bears  the  same  meaning. 

Now,  the  average  probability  of  sequence  (b1;.    . ,  bu.|    )  with 
the  "place-distance"  not  larger  than    m    will  be 

We  now  proceed  to  define  what  we  mean  by  ergodicity  in  this 
paper.     We  consider  all  the  possible,   infinitely  long  sequences  which 
start  with  a  given  initial   sequence  (at  ,    .    .,  a  „_,  )  and  ask  the 
average  probability  of  the  sequence  (b  L  ,    .    .,  b^.,  )  appearing  in 
any  position.     This  probability  evidently  has  the  mathematical  ex- 
pression: 

W     U  "^  (a1;  ..  ,av_,  |  b,,  ,.  ,  byu-i)  (a) 

The  word  average  here  implies  a  two-fold  averaging,  viz.,  first, 
averaging  over  all  the  possible  sequences  with  a  fixed  position 
where  the  sequence  (b  A  ,  .  .,  bu.-|)  should  appear,  and  second, 
averaging  over  all  the  possible  positions  of  this  sequence.  The 
first  averaging  is  mathematically  represented  by  the  matrix  multi- 
plication in  (5),  and  the  second  averaging  by  the  summation  in  (8). 
Definition  II.  If  W\.  U  (a,  ,  ,  .,  aH  lb.,  .  .,  b*-|  )  con- 


verges  to  a  unique,  non-vanishing  limit  independent  of 


8 


(a^  ,  .  .,  av_|  ),  where  (ax  ,  .  . ,  av_!  )  can  be  taken  arbi- 
trarily from  a  certain  family  of  (V  -  1)  -  symbol  sequences  and 
(b  !  ,  .  . ,  by*-,  )  can  be  taken  arbitrarily  from  a  certain  family 
of  (^A-  1)  -  symbol  sequences,  then  we  speak  of  ergodicity  with 
regard  to  these  families. 

We  shall  presently  see  that  the  quantity  (9)  with  a  fixed  ini- 
tial sequence  (a.±   ,  .  .,  a  v-i  )  and  a  fixed  final  sequence  (b^.^M.,) 
indeed  converges  to  a  limit,  say: 

U(a0)  («„.-,  <V>  I  b„..,br,),  (10) 

but  this  limit  is  not  necessarily  larger  than  zero,  nor  is  it  in  general 
necessarily  independent  of  the  initial  sequence.  In  order  to  under- 
stand clearly  the  situation,  let  us  invoke  some  well-known  mathemati- 
cal theorems  regarding  the  Markoff  chains . 

The  ordinary  Markoff  chain  formally  pertains  to  a  two-symbol 
correlation  probability   (ot|Rla)  ,  (  <*,&=  1,  2,  .  .  . ,  M); 

C«*IRlp)>l  ;    L(°MR|(0-1.  (11) 

In  accordance  with  the  usual  rule  of  matrix  multiplication,  we  fur- 
ther introduce 

i   *  >  p  - — — v- — (15) 

Then,  we  have  the  following  theorems; 

Theorem  II.  The  quantity  defined  by 

U'm>t*lp)  =  L^(«IR*lp)  (13) 


:~\ 


1.  See  for  instance  W.^Feller, l Introduction" tb  Probability  ^heory  and 
its  Applications  ( J ohn • Wileyi  New  York,  1950)  pa  307  ff. 


for  any  given  pair  (  w  ,  a   )  converges  to  a  limit  as  m  -*  °*    t 

^If^h^'P^  (04) 

Theorem  III.  The  entire  set  G  of  symbols  (  o(  ■=  1,2,  .  .,  M) 
can  be  divided  into  a  "vanishing"  subset  V  and  a  certain  num~ 
ber  of  "closed"  subsets  0^(1=  1,2,  .  .)  in  such  a  way  that 

JJ*9(pl  |  B )  =  0      for  o(.  belonging  to  Gg  and  for  6  belonging 

to  V, 
IT   W|p)  >0     for  o(    and  (3  belonging  to  the  same  C^, 
U  (of  I  A )  =  0     for  of  and  a  belonging  to  different  C8s0 
Theorem  IV 0    U"   (o(l  a)   is  independent  of  oC  ,  if  of  and  \b 
belong  to  the  same  C. 

Coming  back  to  our  original  topic,  if  the  correlation-range  is 
two,  and  if  M  —  V   ,  these  theorems  can  be  directly  applied  to  our 
problem  involved  in  Definition  U .  If  the  correlation-range  is  >  2S 
we  only  need  to  consider  a  sequence  of  (  V   -  l)  symbols  collectively 
as  a  symbol  cC  „  The  R's  defined  in  (4)  indeed  satisfy  (11) „  The 
casess   M-  ^  V      can  be  handled  with  the  help  of  (6)  and  (7)o 
From  Theorem  II  follows  quite  generally? 
Theorem  V.  The  limit  (10)  exists. 

We  shall  now  discuss  first  the  case  U-  V     in  the  light  of 
Theorems  II,  III  and  IV.  According  to  Theorem  III,  the  entire  set 
of  (  V  -   l)  -  symbol  sequences  is  subdivided  into  a  vanishing  subset 
V  and  a  certain  number  of  closed  subsets  C° .  If  the  final  sequence 
of  (10)  belongs  to  V,  then  U  ^  is  zero  independently  of  the  initial 


sequence.     For  a  given  final  sequence  belonging  to  one  of  the  closed 
subsets,     U  Coo)   will  be  zero  if  the  initial  sequence  belongs  to  another 
closed  subset ,  and  will  have  a  constant  non-vanishing  value  insofar 
as  the  initial  sequence  belongs  to  the  same  closed  subset  as  the  final 
sequence,,     Thus? 

Theorem  VI.     When    \±-V    9  ergodicity  in  the  sense  of  Dei.  II  holds 
if  ii  and    only  if  the  initial  family  and  the  final  family  are  the 

same  closed  subset « 

In  the  cases  where       U>V     $  w©  construct  an  "extended"  closed 
subset     D^  of  (  U.  -  1)   symbols  by  taking  those  (  U.  -  l)  -  symbol  se= 
quences  (bi  ,    .    .,  b  u-\  )  whose  first  (  V  -  l)  symbols  coincide  with 
one  of  the  members  of  the  (V   -  1)  -  symbol  closed  subset     C^  and 
which  satisfy  the   condition: 

Q(bi,..,b,M|by)Q(k*->bv|b,+l)  •••Q(b/.-^-,b/,.Jb^,)  4  0     (15) 

The  extended  vanishing  subset  will  be  composed  of  all  those  (  ^  -  1) 
-  symbol  sequences  whose  first  (  V  —  1)  symbols  coincide  with  one  of 
the  members  of  the  (  y   -  l)  -  symbol  vanishing  subset,  or  whose  first 
( V   -  1)  symbols  coincide  with  one  of  the  members  of  some  closed  sub~ 
set  but  whose  last  (  M  —  V  )  symbols  violate  the  condition  (15). 
The  entire  set  of  possible  (  u  -  1)  -  symbol  sequences  are  thus 
covered  by  the  D's  and  V,  and  there  is  no  possible  overlapping.  If 
the  (M.  -  1)  -  symbol  final  sequence  of  (10)  is  a  member  of  this  ex- 
tended  vanishing  subset,  U    will  certainly  vanish  whatever  the  ini- 
tial sequence  may  be.  If  the  final  sequence  belongs  to  an  extended 
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closed  subset  D^,  then  U    will  vanish  for  an  initial  sequence  be- 
longing to  a  C;  different  from  the  one,,  C. ,  which  corresponds  to 
Dj,  and  will  have  a  constant  non~vanishing  value  for  any  initial  se<= 
quence  belonging  to  C^. 

Theorem  VII.  When  u>V  ,   ergodicity  holds  if  and  only  if  the  initial 
family  is  one  of  the  closed  subset  C.  and  the  final  family  is 
the  extended  closed  subset  D.  corresponding  to  C... 
In  the  cases  where  M  <  V     ,  we  encounter  a  rather  peculiar 
situation,.  From  a  closed  subset  C»  we  construct  a  retrenched  sub= 
set  E.  of  (/A  -  1)  -  symbol  sequences.  E.   is  the  set  of  those 
(  M.  -  1)  -  symbols  sequences  which  coincide  with  the  first  (jU  -  1) 
symbols  of  at  least  one  of  the  members  of  C , .  The  retrenched  vanish- 
ing subset  is  defined  as  the  totality  of  all  those  (  M-  -  l)  -  symbol 
sequences  which  do  not  belong  to  any  one  of  the  retrenched  closed 
subsets.  In  case  of  the  extended  closed  subsets,  a  given  sequence 
of  ( uK   -  1)  symbols  could  not  belong  to  more  than  one  D^,  since  the 
division  made  in  Theorem  III  does  not  allow  for  any  overlapping.  How- 
ever, in  the  present  case  of  retrenched  subsets,  a  given  (  U  —  i  ) 
-symbol  sequence  may  well  belong  to  more  than  one  E.  If  the  ( U  -l) 
-  symbol  final  sequence  of  (10)  belongs  to  the  retrenched  vanishing 
subset,  U    will  always  vanish.  If  the  (  M.  -  1)  -  symbol  final  se- 
quence  belongs  to  Ej,  E»,  .  . ,  E^  ,  then  U    will  be  zero  for  an 
initial  sequence  belonging  to  a  C  different  from  any  one  of  the  oorresp- 
onding  subsets?  C^,  C.,  .  .s  C^.  For  the  same  final  sequence,  U 
may  thus  have  different  non- vanishing  values  according  as  to  which 
one  of  Cj_,  Cj,  .  . ,  C,  the  initial  sequence  belongs. 
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Theorem  VIII,  When  /A  <V>   ,  ergodicity  holds  for  the  initial 
family  identical  with  one  of  the  closed  subset  C.°  and  the 
final  family  identical  with  the  corresponding  retrenched 
subset  E^. 

In  the  foregoing  considerations,  we  have  systematically  omitted 
the  initial  sequences  belonging  to  the  vanishing  subset  V.  The  rea- 
son for  this  is  that  the  U  ioe>'   depends  in  this  case  on  the  detailed 
structure  of  the  intersymbol  correlation,  and  that  we  cannot  draw 
a  conclusion  of  general  validity,,  (Of  course,  if  the  final  sequence 
also  belongs  to  V,  then  U  ^  vanishes). 

Regarding  the  closed  subsets  of  ( ^  =1)  symbols,  we  should  like 
to  mention  the  following  interesting  property.  We  have  obviously 

whence  we  infer; 

Theorem  IX.   (b  z  ,  b  3  ,  .  .  b  p    )  is  a  member  of  C . ,  if  there 
is  any  symbol  bt  such  that  (bt  ,  b^  ,  •  .  by-|  )  is  a  member 
of  Cj_  and  Q(b,  ,  b  x   ,  .  . ,  b  „_,  |  b  v  )  f    0. 
For  a  given  (b1,bi,  .  „,  b^_,  )  there  must  be  at  least  one  bv 
such  that  Q(b-)  ,  bz  ,  .  .,'by.,  |  bv  )^0,  on  account  of  (2).  Hence s 
Theorem  X.  If  (b  .,  s   b  z   ,  „  0,  b„_,  )  is  a  member  of  Cj,  then 
there  is  always  a  member  of  C.  whose  first  (V  -  2)  symbols  are 


Before  closing  this  section^  a  simple  illustration  may  be  given. 
Suppose  the  alphabet  to  be  composed  of  three  symbols;  S^,  S?  and  So, 
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and  to  have  an  intersymbol  correlation  of  range  3° 


Q(Si  . 

,   Si  |  Si  ; 

>  =  i, 

Q(St  . 

.   s3|  s*  ; 

i=  i, 

Q(sx  . 

►  s4| sa  ; 

)=  i, 

Q(s2  . 

►     S,  |  S1  - 

)  =  i, 

Q(s3  . 

,    S3  |  s1  , 

>     =     lo 

Q(S, 
Q(S2 
Q(S2 
Q(S3 


Si 

s, 


)  = 

)  = 


S3|  s,  ) 


Si 

s. 


If 
1, 
1, 

)  =  1, 


Then  the  (V   -  l)  symbol  subsets  ares 


Ci     ;        (S  i  ,  Si    ) 

Ca    :        (Sj.   ,  S,,  ),   (Sa  ,  St  ) 

V3    :        (S*  \  S3  ),   (S3  ,   SL),   (S2   ,   S3   ),   (S3   ,   S2  ),(S,  ,  S3) 


The  extended  3-symbol  subsets  are; 


D 
D 
D 
V» 


(3,,  St,   St   ) 

(S^S,    S,  ),   (Sa    S1     S2   ) 
(S2,S2/  Sz  ) 
all  other  3-symbol  sequences 


The  retrenched  1-symbol  subsets  are: 


E 
E 
E 
V 


Si 
Si,  sx 


i<*>) 


We  can  see  the  overlapping  we  have  discussed;  as  a  result,  U    with  the 
final  sequence  (symbol)  St   ,  for  instance,  becomes' three-valued*  ' 


0-  (sly  sx 

u  <->  (Si,  s2 

u  c~>  (s  ,.  s, 
uw(s2'sz 

All  other   Uc°t 


Si   )=   1 
Sx   )=i 

s,  )=  I 

s,  )  =  0 

3i  )  *  1 
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#3 .  Redundancy 

In  this  section 5  we  shall  constantly  use  a  quantity  denoted  bys 

PCg.,,0^,  ..  ,cu)  >  1.  (17) 

Definition  III.  The  quantity  (17)  represents  the  probability,  in 
infinitely  long  messages,  of  an  arbitrarily  taken  sequence  of 
symbol-length  n  being  a  particular  sequence  (a15  a2  ,..,3-^)0 
From  this  definition  follows  the  normalization  condition; 

E  -Z  P(a,,a2,  ..  ,  cu)  =  1  .  us) 

According  to  the  point  of  view  of  the  last  section,  the  existence 
of  a  unique  value  of  such  a  probability  is  not  unconditionally  guaran- 
teed. Only  if  the  initial  sequence  (\>±   ,  .  .,  by_|  )  is  limited  to 
within  a  closed  subset,  say,  C^,  then 

u<00)  (b1y  .,  ,  bv.,|  al>><yaj 

becomes  independent  of  (b  1   ,.  .,  bv_, ),  i.e.,  a  function  only  of 
(aj,  ,  .,  a^),  If  this  is  the  case,  we  can  write 

tfw(b,,.. ,  by.,  I  a,, .. ,  an)  =  P  (a,,  .. ,  a*0  .        (19) 

According  to  the  theorems  of  the  last  section  j,  if  (at  ,.  . ,  a  n  ) 
belongs  to  C. ,  or  its  extended  subset  Dj_,  or  its  retrenched  subset  E. , 
P  will  be  finite,  and  otherwise  zero.  We  have  therefore  to  restrict  the 
"infinitely  long  messages"  of  Definition  III  to  only  those  which  start 
with  initial  sequences  belonging  to  one  closed  subset,  ^he  condition 
regarding  P  does  not  require  that  all  the  P's  should  be  non-vanish- 
ing,  thence  the  restriction  on  the  final  sequences,  in  the  sense  of 
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Definition  II,  is  not  necessary.  On  account  of  ergodicity,  two  sequences 
starting  from  two  different  initial  sequences  of  the  same  closed  subset 
becomes,  in  the  long  run,  statistically  identical.  It  is  true  that  we 
can  evade  the  restriction  on  the  initial  sequences  by  giving  a  certain 
"weight"  to  each  of  the  closed  subsets,  which  would  lead  to  a  unique 
value  of  each  P.  However,  from  the  point  of  view  that  the  messages  are 
engendered  solely  by  the  correlation  probability,  this  alternative  is 
not  acceptable,  since  it  involves  an  arbitrary  "weight"  of  each  closed 
subset.  Our  discussion  of  this  section  will  be  based  on  the  assumption 
that  the  initial  sequences  are  limited  to  a  single  subset.  The  generali- 
zation of  the  results  to  the  case  of  "weighted"  subsets  is  very  simple. 

It  should  be  noted  that,  as  a  result  of  the  limitation  of  the  ini- 
tial sequences  to  a  single  subset,  it  may  well  happen  that  some  of  the 
generally  possible  sequences  (a1  ,  .  ,,  av_(  )  in  the  correlation  pro- 
bability Q(a,  j  o  . ,  a  y_,  |  a  v  )  actually  never  happen  in  the  possible 
messages,  ^hus  the  actual  range  of  correlation  may  become  smaller  than 
the  range  defined  with  regard  to  the  entire  possibilities  of  the  a's. 
For  instance,  in  the  illustration  of  the  last  section,  if  we  limit  our- 
selves to  the  initial  subset  C2  j  all  3-symbol  Q's  except  Q[Sls   Sx   I  5^  )=1 
and  Q(SaS1  I  Sz  )  —  1  will  become  meaningless.  These  two  3-symbol  cor- 
relation probabilities  reduce  to  the  following  two  2-symbol  correlation 
probabilities;  Q(S  ,  |  Sz   )  =•  1,  and  Q(S  z   |  S,  )-  !„  The  range  is  thus 
reduced  from  three  to  two. 

In  the  empirical  point  of  view5  if  a  population  of  very  long  sample 
messages  is  given,  we  can  always  evaluate  (l?)  by  just  counting  the 
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frequency  of  eaoh  segment  (■%■$.,  a^  ).  However,  if  we  divide  this  entire 
population  into,  say,  two  groups,  the  values  of  (17)  maybe  different  in 
the  two  groups.  This  discrepancy  may  be  caused  by  a  difference  in  cor- 
relation probabilities  and/or  by  a  difference  in  the  initial  sequences. 
We  thus  see  that  the  problem  of  ergodicity  is  not  irrelevant  to  the  em- 
pirical point  of  view.  In  this  section,  however,  we  assume  that  we  have 
a  single  population  from  which  the  quantities  of  the  type  (17)  are 
uniquely  determined. 

The  quantity  (17)  has,  besides  (18),  the  property: 

2r  P  (a,,  ..,  Q|c  ,  bi  ,  '•  ,  bw  ,  O-k+m  +  l  y  •  •  ;  ^n) 

=  P(b,,  ».,  *>*)  ,  (20) 

his  is  obvious  from  the  statistical  point  of  view,  but  can  also  be 
verified  from  the  standpoint  of  (19). 

According  to  (6),  we  have  for  n  >  y 

PCa,;..,an)  =  P(Q,,. .,a„_,  )<}((*,,..,&„_,  |  a„)-  Q(an-y+l,..,GU-,|Gu)  j  (El) 
or  more  generally, 

P(a„..,  ah)«P(av..(,a/4.,)QCai/..,a^|cy)»-  Q  (a*.^, , . .  ,4v-.|aH),  (22) 

provided  n>M.2.V  .  Equivalence  of  (21)  and  (22)  can  readily  be  seen 
with  the  help  of  (3)  and  (6).  In  particular,  for  n  =  u  2:  V  >   we  get 
from  (22) 


Ca„..,<y,K)  =  -pCa ^   .  (23) 

This  is  just  what  should  be  according  to  Definitions  I  and  III. 
(23)  may  be  considered  as  the  definition  of  Q(ax  ,  .  .,  a._,  j  a^  )  even 
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for    u<V     .     However,  with  such  Q's  with   yu<V      ,   ^22)  will  not  be  true, 
since  the  Q's  with   M<  V      cannot  describe  fully  the  existing  correlation,, 
Substituting  (23)  into  (22),  we  get 

PCfli,-.<V)P(Qi,"jP/utl)    -     PC^n-^l  ,--,Qn)  (24) 

PCa*,--,^)   •-•   PCan_/U+1;-.,aM_l) 

T 
provided  n  >M  >  V  .   he  actual  range  V    is  thus  the  minimum  value  of 

M-  for  which  the  decomposition  (24)  is  allowed. 

For  an  allowed  value  of  jXf   if  a  further  decomposition  of  range 

M-  1  is  still  allowed,  i.e.,  if  u-\  >  \)    ,  then  we  get  from  (24) 

p(a,;../a],KV^^i-Vl  (25) 

for  all  (ax  ,  .  .,  a  „  ).  But  if  U-1  <V,  the  left  side  of  (25)  will 
not  be  equal  to  its  right  side  for  at  least  one  sequence  (ax   ,  .  .,  a^  ). 
•Mius  we  are  led  to  use  (25)  as  a  criterion  to  determine  whether  M  >  y 
or  not:  If  ^25)  holds  for  all  (a1  ,  .  „,  a  u.   ),  then  U>  P     ;  if  not, 
m  <  y    .  Indeed,  if  (25)  is  possible,  we  have  in  virtue  of  (23)5 
i   n   P(ai,--,0 


PCal,../V<)=QCai'"'V'la/°     (26) 


p(a2,  ••,<>) 

V- 

i0e.,  Q  of  range  u.  is  reducible  to  a  Q  of  range  (u  -  1).  In  the 

light  of  Theorem  I,  this  means  that  the  actual  range  is  (  M-  -  l)  or 

less.  If  (25)  breaks  down  for  at  least  one  sequence  (a  L  ,  .  „,  a  ^  )5 

then  (26)  does  not  hold  in  general,  meaning  that  the  actual  range  is 
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larger  than  (m.  -  1). 

Theorem  XI.  If  and  only  if  (25)  holds  for  all  (at  ,  .,  a^  ),  the 

actual  correlation  range  V    is  (  /M.  -  l)  or  less. 

This  criterion  is  interesting  particularly  in  the  empirical  point 
of  view,  for  here  the  P's,  instead  of  the  Q's,  are  the  quantities  which 
are  primarily  given.  The  criterion  of  Theorem  XI  can  be  brought  to  a 
more  concise  form  by  the  help  of  the  well-known  theorem  attributed  to 
W.  Gibbs: 

Theorem  XII.     If 

f^O,    po    ,      and_        ?/i  -$Jt    ,  0=1,2,-, r),  (27) 

then 

W^fv^-Z/^J.JO,  (28) 

where  the  equality  holds  only  when  f.=  g^  for  all  i. 
Now,  let  us  call  the  left-hand  side  and  the  right-hand  side  of 
(25),  respectively 

fI(ab..,a/A)  =  PCa„..,a/4)  ^  (29) 

^■■■^—^7^7) (30) 

and  consider  the  index  i  of  Theorem  XII  as  a  collective  index  for 
various  possible  sequences  of  symbol-length  M.  .  On  account  of  (18) 
and  (20),  the  conditions  (27)  are  satisfied,  and  we  obtain 

-  2  I  P  (a„ .. ,  V  )  H  {Au ' ' '  V'  ^ 

-t  2  PCa,,.. ,  a.rO^  Ca,y-«  ,  VO  *  0  ,      (31) 
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Only  when  (25)  holds  for  all  (a^  ,.,  a.u  )9   then  W^  —  0.  In  other 
words,  for  a  given  value  of  V  ,  ¥,  =  0  for  [A  >  V    .  this  leads  to  a 
convenient  way  to  determine  the  actual  ranges 

Theorem  XIII 0  The  actual  range  V    is  the  maximum  value  of  /^  for 

which  W^  j=    0. 
The  W's  defined  by  (31)  will  be  called  "correlation  indicies". 

For  M  =  2,  the  definition  of  Wu  in  (31)  should  be  understood 
as  meaning 

W2=  £PCailoi)iejPCa1/aa)-2l  PCaofcjPfc,),        (32) 

for  we  have  here  a  (a,  ,02.)  =  PtQi)P(aa)  . 

We  shall  now  proceed  to  find  out  the  average  amount  of  information 
carried  by  a  message-segment  of  length  n  in  a  language  in  which  the 
P's  exist.  A  specific  message- segment  (at,  ..,  a  ^  )  has  probability 
P(au  .0,  a  ^  ).  ^hus  the  information  per  symbol  carried  by  this  mes- 
sage-segment is 

-^  i<m   p(a,,.-,an)  . 

The  probability  of  occurrence  of  such  a  message  being  P(a1,«.,  a  „.  ), 
the  average  information  per  symbol  for  various  possible  message-segments 
of  length  n  is  given  by 

In-  -\Z  Kai,..,aO^P(a„..,GO  (33) 

Now,  if  the  existing  correlation  is  of  range  y  ,  the  P  can  be  decom- 
posed as  in  (24)  with  u  =  v    „  A  straightforward  calculation  with  the 
help  of  (18)  and  (20)  gives 

+  ^  Cn-i/)2P(Qii-,ay.l )  l^puu  ..,  av.,).   (34) 
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For  an  obvious  reason  this  V    can  be  the  actual  minimum  range  or  any  v 
that  is  larger  than  this.  °upposing  V    in  (34)  to  be  the  actual  mini- 
mum range,  let  us  find  the  error  which  would  be  committed  by  the  calcu- 
lation  based  on  the  assumption  that  the  actual  range  were  V-   1.  ■'•his 
is  easily  found  to  be 

ln,v   in;j/-i         ^    wv  . 
Repeating  this  process,  we  obtain 


(35) 


In-I°=I,v-l'=-A^Wr,  ^^ 


where 


1°  =  In,i  =  -  £  ?CO,)AogP  CO,)   . 


(37) 


Since  ¥  u.  vanishes  anyway  for  M>)/,  we  can  state: 

Theorem  XIV.  The  average  information  per  symbol  carried  by  a  mes- 
sage-segment of  length  n  is 

insofar  as  n  is  larger  than  the  actual  correlation  range. 


Since  the  W's  are  zero  or  positive,  the  intersymbol  correlation 
tends  to  decrease  the  amount  of  information.  Thus,  Wu  can  be  considered 
to  represent  the  "strength"  of  correlation  —  strength  in  the  sense  of 
reducing  the  amount  of  information,  ^y  definition,  1^  cannot  be  nega- 
tive, thence  there  is  an  upper  limit  to  the  total  "strength"  of  the  cor- 
relations 


?  Yl-Wt 


r*     *      ^      r2-    r  -  (39) 
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For  n  »  v     ,  we  obtain  from  (38), 


1**1--  I5-£.W/*     U»v)  (40) 


showing  that  if  we  take  a  sufficiently  long  segment  as  a  unit,  the  in- 
formation per  symbol  becomes  independent  of  the  length  of  the  segment. 
This  indirectly  justifies  the  usual  procedure  according  to  which  an  in- 
finitely long  message;  is  cut  into  segments  of  sufficient  length  and 
the  segments  are  treated  as  if  they  did  not  have  any  correlation  among 
them. 

The  quantity  called  "redundancy"  is  defined  by^ 

R  =  ~?  .  (41) 

Theorem  XV.  The  redundancy  of  a  language  which  is  characterized 
by  the  correlation  indices  W  ^  is  given  by 

R-joZW^   ,   0<R<1,  (42) 

In  the  illustration  of  the  last  section,  if  we  limit  the  initial 
sequences  to  C  2  ,  we  get 

W4  =  l<ra   2.   ,   W3  =  Wj  =  -  -  -  =  0 
1°  -  hml.      ,  :  loo  =  0    ,     R  ^  xoo  %   . 

This  last  result  is  not  surprising,  because  the  possible  infinite  sequen- 
•A^«,re"T[imit6d  tor  .  *>  Sj_  S2  St  S  ^  .  .  .,  which  certainly  cannot 
convey  any  information. 


2.  Stanford  Goldman,  Info rmat ion  Theory  ( Prentice-Hall,  New  York,,  1953) 
p.  45. 
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